CIV1498 - Introduction to Data Science

Project - Toronto Bike Share

PART II: Eploratory Data Analysis

By: Gneiss Data (Greig Knox and Yoko Yanagimura)

In this section, we will work with the cleaned and merged bike station and weather data derived from Part I. We will explore the dataset to extract insights that will help answer some of the questions posed by the City of Toronto and hopefully also provide some valuable insights determined outside the scope of work.

The list below shows the questions that were posed by the City of Toronto. We aim to answer the following questions through the analyses presented in this notebook:

  1. When are people using the bike share system? How does usage vary across the year, the week, and the day?

  2. Is there a difference in usage behaviour between Casual and Annual Member riders?

  3. Is there an increasing trend of usage from 2017 to 2020 and is the trend the same for both Casual and Annual Member riders?

  4. How popular is FREE RIDE WEDNESDAYS?

  5. How did usage change in 2020 due to the pandemic and government-mandated lockdowns?

  6. How do statutory holidays impact demand?

  7. Which neighbourhoods have seen the largest number of rides depart from bike stations located within their boundaries?

  8. Which neighbourhoods have seen the largest number of rides end at bike stations located within their boundaries?

  9. How does the weather change the way people use the bike share system?

  10. What weather features are most influential (temperature, humidity, precipitation, etc.)?

We have also conducted a series of analyses to provide some of our own insights into the dataset. We have attempted to answer the following questions through our analyses:

  1. Does bike-share usage vary depending on proximity to TTC subway and streetcar stations?

  2. If everyone is travelling along bike paths, which bike paths are the most congested and at what times of the day?

  3. Are there seasonal trends in trip duration?

0. Setup Notebook and Import Data

We will import the cleaned bike trip dataset that will be used for analysis.

We will keep the backup of the trip data saved in a different variable here, in case we need to reset the applied manipulations.

1. Understanding the Dataset

Before diving into the analysis, it is important to understand and define the following properties about our dataset.

  1. Structure - What is the format of our data file?

  2. Scope - How complete is our data set?

  3. Granularity - How fine or coarse is each row and column?

Some of these questions were already explored in the Part I (Data Cleaning and Wrangling), but we would like to reiterate the subject in more details in this section.

Structure

The structure of the datafile can be examined by looking at the "shape". We can see that there are 8,007,423 rows and 32 columns. This means we have over 8 million trip records in the database from Janyary 2017 to November 2020.

What are the fields (e.g. columns) in each record? What is the type of each column? We can use .columns to examine this. We can see that there data about the trip and the associated information about the weather when the trip was taken. We joined the bike trip data and the weather dataframe together in Part I.

There are some redundant columns in the database, for example 'start_station_name' contains the same information as 'start_station_name_npl' except in the latter, the special characters have been removed. This is a relic column from the data wrangling process, but it has been left in here in case we prefer to access the station names without the special characters again.

Otherwise, it can be seen that there is information about the origin and destination of a trip (bike station name and id), their location coordinates, as well as information about the weather collected at the City of Toronto weather station at the time the trip was taken.

Scope

To summarize the data cleaning conducted in Part I, the records removed from the dataset include:

Below shows the number and percentage of cells with null values (missing data). All trip records contain information about the start and end stations and their respective location coordinates. There are trips missing 'bike_id' and 'subscription_id', but this is because these information were not available for the 2017 ad 2018 datasets. Some trip records are missing information about the weather.

We use '.info' to see the datatype of each column contained the data_merged dataframe.

We can see that the 'start_time', and 'end_time' need to be converted into datetime objects again localized to the EST timezone.

We can see that the 'start_station_id', and 'end_station_id' can be converted into integers.

Granularity

Granularity refers to how fine or coarse is each datum. The raw data that we will be using in our exploratory data analysis, contained in the 'df_trips_data' dataframe contains one trip per row. Throughout the analysis, we will use the '.groupby.agg()' function to manipulate the granularity to obtain, for example, the daily ride counts or hourly ride counts.

Below we use the .groupby.agg() function to determine the daily ride count by date. We also broke the trip dataset down by the membership type to determine if the observed trend is the same for both casual and annual member riders.

1. When are people using the bike share system? How does usage vary across the year, the week, and the day?

As part of this question, we will determine if there is a pattern in the bike share usage over the year, week and the day.

Figure 1 shows the total ride counts by year in the df_trips_data database. There is a increasing number of rides every year. The % increase corresponds to 35% from 2017 to 2018, 27% from 2018 to 2019 and 6% from 2019 to 2020. The percentage increase may be slighly underestimated for 2020 as it does not include the ride counts from November and December 2020, but the increase was probably still not as significant as the previous years.

Then, we will look at the yearly trend by plotting the daily trip count found in df_usage_perd_by_mem against the dayofyear. The dayofyear column returns the day of the year on which the particular date occurs.

In this plot titled Figure 2A - Daily Ride Counts by Day of Year between 2017 to 2020, we see that there is a cyclic peak between Day 180 and 270, which correspond to Juy - October. However we can see that there is quite a bit of day to day fluctuations that make the annual pattern hard to decipher.

In the subsequent plot, Figure 2B - Average Daily Ride Counts (with 95% Confidence interval) by Month Between 2017 and 2020, we plot the average daily ride count per month between 2017 and 2020. We can see a distinct peak in the summer around July and August, and usage tend to be lowest in the winter months between December and March.

Now, we will examine the weekly trend by plotting the daily trip count found in df_usage_perd_by_mem against the dayofweek. The dayofweek column returns the day of the week on which the particular date occurs.

In the plot titled Figure 3A - Average Daily Ride Counts by Day of Week, we see that Wednesday seems to be the most popular day in terms of the average daily ride count while Sunday is the least popular. The similar is observed for Figure 3B - Total Ride Counts by Day of Week which shows the total ride count by day of the week.

In the subsequent plot, Figure 4 -Average Daily Ride Counts by Day of Week by Membership Type, we plot the average daily ride count by Day of Week and Membership Type. We start to see a different trend emerge for different membership types. The annual members tend to use the bikes more on the weekdays, while the casual members tend to use the bikes more on the weekends. However, wednesday seems to be the most popular weekday for both member types. Because there is significantly more rides taken by the annual members (rides by annual members make 77% of the total trips included in df_trips_daya), the overall trend observed for all member types is similar to the trend observed for annual members.

The differences between the membership types will be revisited again and examined in more detail as part of Question 2 found in this notebook.

Now, we will examine the daily bike usage by looking at the distribution of bike rides by the hour of the day.

In this plot titled Figure 5 - Bike Usage Throughout Day, we are looking at the probability density function of the bike trip counts throughout the day. We can see two peaks in the morning between 8h30 and 11AM, then a much bigger peak again between 5PM and 8PM. There evidence to suggest that many of the bike users uitlize the bike share program for commuting purposed. For this reason, we see major peaks during the morning and evening rush hour. It is our belief that the evening peak is also much bigger than the morning peak because most of the recreational rides also occur in the afternoon and evening. This will become more evident in subsequent analysis when we analyze the differenced between the annual and casual members.

2. Is there a difference in usage behaviour between Casual and Annual Member riders?

In this question, we will look at the trends that we have begun to identify as part of Question 1, and determine whether these trends hold for both types of members: casual and annual. Annual members are people who pay for their annual pass, which gives the user unlimited 30 minute rides for the year. On the other hand, casual members are people who opt to pay for their bike ride per use.

As part of this question, we explored the following questions:

1) How does the daily ride count differ betwween casual and annual riders?

2) How does the bike usage differ over the week for casual and annual riders?

3) Is there a change in usage whether it is a weekday or weekend for casual and annual members?

4) Is there a change in usage by time of day for casual and annual members?

5) Is there a difference in trip duration for casual and annual members?

We believe that the annual members have obtained their membership because they use the bike share on a regular basis. For this reason, they would be taking the bike trips mainly for commuting in additional some recreational trip they take after work hours or on the weekends. On the other hand, casual members do not rely on the bikes to get from Point A to Point B on a regular basis, so bike rides are not used for commuting like the annual members. Because the intentions of the annual and casual members are different, we anticipated that the usage will differ by membership type.

Based on the calculation below, trips taken by annual members make up the bulk of the trips contained in 'df_trips_data'.

How does the daily ride count differ betwween casual and annual riders?

Figure 6 - Distribution of Daily Ride Counts for Casual Members and Annual Members from 2017 to 2020 shows the probability density distribution of daily ride counts for annual and casual members. Overall as expected, annual members tend to take more bike trips on any given day compared to the casual riders.

How does the bike usage differ over the week for casual and annual riders?

The difference in bike usage over the week is determined for casual and annual riders by looking at the total number of rides by day of week and membership type. As shown in Figure 7 - Total Ride Counts by Day of Week and Membership Type , annual riders tend to use the bike more over the weekday while casual riders prefer to use the bike over the weekend. The results shown here is similar to what was observed in Figure 4 .

Is there a change in usage whether it is a weekday or weekend for casual and annual members?

The scatter plot below Figure 8 - Comparison of Daily Ride Counts for Casual Members and Annual Members on Weekday or Weekend compared the daily ride count for casual and annual riders on any given date. On the weekdays, there is an increase in the proportion of bike rides taken by annual members compared to casual riders. On the weekend, the number of bike rides by annual riders proportionally decreases compared to the number of casual riders. This trend was also observed in the barplot shown in Figure 4 .

Considering the trend in behavour over the weekday and weekend, two trends are noted, one for annual members and one for causal members. Based on the figure below titled Figure 9 - Average Daily Ride Counts for Weekday or Weekend By Membership Type and Year. Annual members show a higher frequency of usage during the week than on the weekends while casual members display an inverse trend with a higher usage during the weekends compared to weekdays.

It is interesting to note that the gap between the weekday and weekend usage decreases in 2020 for the annual riders. We think this is partially due to the impact of COVID-19, which caused a lot of people to work from home. This eliminated the need for people to commute on a regular basis. For this reason, the weekday usage for annual members did not go up significantly compared to the previous year, while the usage over the weekend continued to increase. We also see in 2020 that the weekend usage went up significantly for casual members. We speculate that bike share was considered by many people to be a safe recreational activity in the midst of the COVID lockdowns experienced in Toronto.

Is there a change in usage by time of day for casual and annual members?

The graph below titled Figure 10 - Hourly Distribution of Ride Counts for Casual Members and Annual Members depicts the hourly ride counts by membership type. For annual members, there are distinct peaks during rush hour indicating the high demand for bikes during typical commuting times (8-11AM and 5-8PM). For casual rider, the demands tend to start increasing in the afternoon and peaks in the evening between 6 - 8PM.

Is there a difference in trip duration for casual and annual members?

It is interesting to note that based on the Figure below ( Figure 11 - Trip Duration Distribution for Casual Members and Annual Members ) looking at the distribution of the trip duration for casual and annual members, it shows that casual members tend to take longer trips compared to annual members. This is likely due to the intent of the annual and casual riders. Annual riders use the bikeshare to get from Point A to Point B and they have unlimited 30 minute rides so they do not feel the need to prolong the ride duration. On the other hand, casual members tend to use the bike share for more recreational purposes and they pay per 30 minute ride, so they may be motivated to prolong the trip duration as long as they can without incurring additional fees.

3. Is there an increasing trend of usage from 2017 to 2020 and is the trend the same for both Casual and Annual Member riders?

Is there an increasing trend of usage from 2017 to 2020?

In the first part of the question, intially we looked at lineplot ( Figure 12 - Number of Rides Per Day between 2017 to 2020 ) to show the daily trip count between 2017 and 2020 for all member types. We can definitely see an increasing trend over time as the peak gets bigger each year, but the trend is slightly confounded by day-to-day fluctuations in the daily trip count. It is interesting to note that the fluctuations in 2020 were more significant compared to previous years.

We also looked at the average daily bike count for each year to determine if there are any changes to the average. If there is an increasing trend, there should be an corresponding increase in the average from 2017 to 2020. In Figure 13A - Average Daily Ride Counts by Year , we can see an increase in the average daily bike ride, which suggests that there is an increasing trend of usage from 2017 to 2020 on a daily basis. When you look at Figure 13B- Average Monthly Ride Count by Year , There is a large confidence interval associated with the monthly average. This is attributed to the large daily fluctuations in the rider count, but there was also a prolonged time period at the start of the pandemic between March and May when the ride counts were extremely low.

Comparisons of average daily ride counts by month over the years ( Figure 14 - Average Daily Rides (with 95% Confidence interval) by Month between 2017 to 2020 ) show that there is increasing trend of usage from 2017 to 2020, particularly between May and October. The trend is not as obvious during the colder months between November and April.

Keep in mind that we did not have any bike trip data for November and December 2020. For this reason, there comparison is not available for November and December for Year 2020.

Is the trend the same for both Casual and Annual Member riders?

We also looked at lineplot ( Figure 15 - Daily Ride Counts for Annual and Casual Members Between 2017 to 2020 ) to show the daily trip count between 2017 and 2020 for annual and casual members For both member types, we can definitely see a increasing trend over time as the peak gets bigger each year, but the trends are again slightly confounded by day-to-day fluctuations in the daily trip count.

Comparisons of average daily ride counts by month over the years is shown for each member type in Figure 16 and Figure 17.

Based on these graphs, we can see that there is a increase in the useage for casual and anual members from 2017 to 2020. When we breakdown the daily ride count by membership type, however, we can see that the biggest jumps in useage was between 2018-2019 and 2019-2020 for casual members. For the annual members, the usage jumps were more significant between 2017-2018 and 2018-2019 to a lesser degree. As such, there is some differences in the trend for annual and casual riders.

We believe that we did not see a big increase in the annual riders between 2019 and 2020, as we saw for the casual riders, because of the impact of covid-19. The uncertainty around covid-19 in 2020 and the pressure to work from home reduced the number of people who needed to commute to the offices. Because the need for commuting reduced, 2020 was not a big year for annual riders. The impact of COVID-19 will be examined in more detail in Question 5.

4. How popular is FREE RIDE WEDNESDAYS?

The popularity of FREE RIDE WEDNESDAYS can be assessed by looking at the change the ride count over the course of the week. If it is popular, we should see more rides being taken on Wednesdays.

If the FREE WEDNESDAY promotion was popular, we would anticipate that it would be popular among the casual members, but not necessarily the annual members. The annual members already have unlimited 30 minute bike rides for the year, so they do not need to take advantage of the FREE RIDE WEDNESDAY.

FREE RIDE WEDNESDAY is a promotion offered once a month every year, where the bikeshare bikes become free on the wednesdays of that promo month. In 2020, article this promotion was only offered in the month of September for 2020. This article would suggest that the promotion was offered in August for 2019. In 2018, it was offered in the month of June. In 2017, it was offered in the month of July.

We will narrow our dataset to these specific months for which we know FREE RIDE WEDNESDAY promotion was offered.

Figure 18 - Average Daily Rides (with 95% Confidence Interval) by Day of Week and Year for Months with FREE RIDE WEDNESDAY promotion shown below demonstrates that FREE RIDE WEDNESDAY promotions do significantly increase the number of rides taken on Wednesdays. For 2017 - 2019, the frequency of rides taken more than doubles compared to other weekdays. For 2020, the frequency does not quite double, but there is significantly more rides taken on the Wednesdays when the promotion is available.

Figure 19 - Average Daily Rides (with 95% Confidence Interval) Comparing Months with and without FREE RIDE WEDNESDAY Promotion compares the average daily ride count for months with and without FREE RIDE WEDNESDAY promotion. It is evident that the promotion does significantly increase the number of rides taken on Wednesdays. When the promotion is available, the trip count is 5x greater compared to when the promotion is not available. When the promotion is available, the ride frequency on Wednesdays exceeds that on the weekends.

5. How did usage change in 2020 due to the pandemic and government-mandated lockdowns?

As part of this question, we need to examine how the bike usage changed in 2020 compared to all the other previous years. We will answer the following questions to determine how the COVID-19 pandemic and government-mandated lockdowns impacted the bike share usage.

  1. Did the increasing trend in bike usage seen over the past years also seen in 2020?

  2. How did the hourly distribution of ride counts change in 2020 compared to previous years?

  3. Has the pandemic impacted the bike trip duration?

  4. Is there an increase in trips where the start and end locations are the same?

Did the increasing trend in bike usage seen over the past years also seen in 2020?

The conclusion above is further reiterated in Figure 20C as we see a big decrease in the total number of rides taken by annual members compared to casual members in 2020. However, it seems this trend was also somewhat observed in 2019. It would be interesting to see the total annual membership numbers have been increasing or decreasing over the years to determine if a decreasing member is the cause for this trend.

How did the hourly distribution of ride counts change in 2020 compared to previous years?

Impact of COVID-19 on Hourly Usage:

Has the pandemic impacted the bike trip duration?

Impact of COVID-19 on Trip Duration:

First, to get a sense on how the trip duration changes with the pandemic, we derived the statistics for trip duration using the '.groupby.agg()' function. We determined the mean, mode and median trip duration by year and membership type in the variable trip_dur.

The following plot ( Figure 22 ) shows the median trip duration in minutes by year and membership type.

Is there an increase in trips where the start and end locations are the same?

We speculated that during the pandemic, the recreational use of the bikeshare increased as opposed to its use for commuting (or specifically for getting from Point A to Point B). To detetmine this, we looked at the number of trips by year where the start and end locations were the same. If users are using the bikeshare for the sole purpose of getting exercise, we would anticipate an increase in the number of trips where the start and end location is the same.

As seen in Figure 24 - Number of Rides With Same Start and End Stations By Year there is a significant increase in the numner of trips with the same start and end stations, which strongly indicate that in 2020 there was an increase in the number of people, both casual and annual members, who began to use the bike share program to get their exercise in rather than for the purpose of getting from Point A to Point B.

6. How do statutory holidays impact demand?

To determine how statuatory holidays impact demand, we first generated a list of Canadian holidays. Then, we subset the df_trips_data for trips that occurred on statuatory holidays and used the data to determined the hourly usage.

Figure 24 shows the hourly distribution of ride counts comparing usage on statuatory holidays and weekdays. Figure 25 shows the hourly distribution of ride counts comparing usage on statuatory holidays and weekends. Figure 26 shows the hourly distribution of ride counts comparing usage on statuatory holidays and weekends by membership type.

From the analysis it was determined that on statuatory holidays, the hourly usage is similar to that of weekends, for both annual and casual members.

7. Which neighbourhoods have seen the largest number of rides depart from bike stations located within their boundaries?

8. Which neighbourhoods have seen the largest number of rides end at bike stations located within their boundaries?

Question 7 and 8 will be answered together below. As part of these questions, we conducted the following analysis:

  1. Import a map of the neighbourhoods
  2. Determine which bike stations are in which neighboorhood
  3. Count rides starting out or ending in a Neighbourhood
  4. Generate Heat map based on the number of ride originating or ending in the neighbourhood.

The bikeshare_stations GeoDataFrame does not contain crs information because we contructed it ourselves from (lat,lon) coordinates. However, we know from publicbikesystem.net that the station locations have the same crs as neighbourhoods.

So, the crs of bike stations df_stations was set to EPSG:4326.

Initially, the neighbourhood analysis was conducted by year as we were not sure if the most popular start and end location of bike trips changes from year to year.

Based on the analysis above, it appears that the most popular neighbourhood for start and end of trips is Waterfront Communities-The Island . This does not change from year to year.

As such moving forward with our analysis, the dataset was treated as a single dataset (without breaking it out by year) to determine the neighbourhoods with the largest number of rides departing and ending within their boundaries.

In the analysis below, we did a check to see how many stations were located in each neighbourhood in Toronto.

It was determined that the Waterfront Communities-The Island had the most number of stations located in the area. Since this neighbourhood has the most stations, it could be expected that it would also have the greatest number of rides beginning and ending in the neighbourhood.

In the cell below, we determine the number of rides originating and ending at each station. Then, the values are merged with the df_stations, which is the geopandas Dataframe containing the geometry (point) of each bike station in Toronto. Using this, the ride count at each station is added if the station falls within the boundaries of each neighborhood and assigned to the neighbourhood in the df_neighbourhoods dataframe. Simulatneously, the percentage of total rides starting and ending at each neighbourhood is calculated for each neighbourhood. Finally, the df_neighbourhoods is subset to only contain neighbourhoods with at least one trip originating or ending within its bounds, and saved to df_neighbourhoods_map. This filter removes all neighbourhoods without a bike station within its boundaries.

Below shows the top 5 neighbourhoods with the largest number of rides originating from within its boundaries.

Below shows the top 5 neighbourhoods with the largest number of rides terminating within its boundaries.

In the map below, we have plotted the Choropleth map of Toronto showing the neighbourhoods with the largest number of rides depart from bike stations located within their boundaries.

In the map below, we have plotted the Choropleth map of Toronto showing the neighbourhoods with the largest number of rides end at bike stations located within their boundaries.

It is interesting to note that if the number of rides departing or terminating in a neighbourhood is normalized to the size of the neighbourhood, we obtain a slightly different results. In the maps below, we created two columns with the number of rides normalized to the area of each neighbourhood. When we plot the result, there are some neighbourhoods that begin pop out that were not seen in the previous maps.

Below shows a map of top neighbourhoods with largest number of start rides originating out of the neighbourhood, normalized to neighbourhood area.

Below shows a map of top neighbourhoods with largest number of end rides terminating in the neighbourhood, normalized to neighbourhood area.

9. How does the weather change the way people use the bike share system?

10. What weather features are most influential (temperature, humidity, precipitation, etc.)?

As part of question 9 and 10, we will explore the following questions:

  1. What kind of weather and temperature conditions are most of the bike trips taken in?

  2. How does temperature, relative humidity and weather conditions impact the trip duration?

  3. How does the relative humidity, temperature and weather condition affect the hourly and daily ride count?

What kind of weather and temperature conditions are most of the bike trips taken in?

Bike rides tends to be popular when the temperature is between 15 and 27 degrees Celsius based on Figure 28 - Number of Bike Trips as a Function of Temperature (2017 - 2020). It seems like there are a few people who continue to use the bike share program between 0 to 15 degrees Celsius. Outside these temperature ranges, the number of users who use the bike share system is very limited.

Similar analysis of the relative humidity shown in Figure 29 - Number of Bike Trips as a Function of Reative Humidity (2017 - 2020) also show that there is perhaps an optimal humidity range around 65 and 85 % where riders prefer to go riding.

The weather descriptions in the 'weather' column found in df_trips_data were simplified into the dominant types. For example, Thunderstorms,Rain,Fog' was simplified into 'Thunderstorm' and 'Snow,Blowing_Snow' into 'Snow' based on the first weather phenomena included in the description. This information is stored in the column called category.

The proportion of bike rides taken in different weather conditions is shown below in Figure 30 - Percentage of Bike Trips in Database as Function of Weather Condition . The data shows that poor weather conditions strongly deters users from using the bikes, given that over 92% of bike trips in the entire database were taken when the weather was clear.

Moving forward, it will be valuable to distinguish between good weather (clear day) and poor weather (all other weather phenomenon). To achieve this, we create a new column in df_trip_data called weather2 where 'Clear day' is called 'Clear' and all other weather phenomenon is called 'Precipitation'. Doing so, we can reduce the weather condition into two categories as shown in Figure 31 - Total Bike Rides from 2017 to 2020 as a Function of Weather Condition.

In Figure 32 - Comparison of Temperature and Relative Humidity for Trips Taken by Annual and Casual Members , the temperature and relative humidity of each trip was plotted on a kde plot by membership type. This reveals something very interesting about the rider preference. The annual riders tend to be less fussy about the temperature and humidity condition compared to casual riders when making decisions about taking a ride.

How does temperature, relative humidity and weather conditions impact the trip duration?

Figure 33A/B - The Effect of Weather on the Distribution of Trip Duration uses a violin plot to examine the influence of weather condition on the trip duration. Although it appears the trip duration may slightly higher for clear days, on average, the weather condition doesn't seem to significantly impact the duration of trips. Even in bad weather, once the rider is committed to taking the ride, they do not seem to terminate the trip early or attempt to speed to their next destination.

The below figure Figure 34 - The Effect of Temperature on the Distribution of Trip Duration is a violin plot showing the distribution of trip duration as a function of temperture. It shows that as the temperture increases, there is a larger proportion of longer trips.

Figure 35 - The Effect of Relative Humidity on the Distribution of Trip Duration looks at the influence of relative humidity on the trip duration. Unlike temperature, the relative humidity does not seem to have a significant impact on the trip duration.

How does the relative humidity, temperature and weather condition affect the hourly and daily ride count?

To further investigate the impact of weather condition on the bike usage, we use the groupby.agg() function to determine:

This information is saved in the variable called hourly_rides_and_weather.

Figure 36 - The Hourly Ride Count as a Function of Weather Condition shows the distribution of the hourly ride count as a function of weather condition and membership type. On average, there are significantly more riders in the hour if the weather condition is clear. Nevertheless, the boxplots show that there are there are significant fluctuations and hence outliers in the data. Some of the outliers have been cut from the plot to show the box plot at an appropriate scale.

Figure 37 - Effect of Wind Speed on Hourly Ride Counts shows the effect of wind speed on the hourly ride counts. There is a strong negative correlation between wind speed and ride counts for both annual and casual members.

To investigate the impact of weather condition on the daily bike usage, we use the groupby.agg() function to determine:

This information is saved in the variable called daily_rides_and_weather.

Figure 38 - Effect of Weather Condition on Daily Ride Counts shows the distribution of daily ride counts by weather conditions. There are typically very low daily ride counts observed when more than 50% of the day is poor weather conditions. For clear days (or days with more than 50% of the day is clear), the daily ride counts are much higher.

Figure 39 - Effect of Temperature on Daily Ride Counts show the daily ride counts as a function of temperature. Both for annual and casual members, the daily ride counts tend to increase as the temperature increases. There seems to be a peak around 25 -27 degrees Celsius. It is interesting to note that most of the riders the use the bike share below 0 degrees is predominantly annual members.

Figure 40 - Effect of Relative Humidity on Daily Ride Counts show the daily ride counts as a function of relative humidity. Correlation between the daily ride count and relative humidity is much weaker than what was observed for temperature, although there is a slight positive correlation for both for annual and casual members.

Figure 41 - The Daily Ride Count as a Function of Member Type and Weather Condition shows the distribution of the daily ride count as a function of weather condition and membership type. The graph demonstrates that weather condition is a very strong predictor of the daily ride count for both annual and casual riders.

What weather features are most influential (temperature, humidity, precipitation, etc.)?

From the above analyses, we determined that wind speed, temperature and precipitation (gauged through the weather condition) are the strongest predictors of bike usage.

Below, we have conducted a bivariate kde analysis for temperature and wind speed. It shows that while annual members are not as fussy as the casual members about the temperature conditions, both member types are similar in their preference to not ride in high wind speeds.

1. Does bike-share usage vary depending on proximity to TTC subway and streetcar stations?

As part of this question, we will first determine the which bike stations are 200m of the subwys station. Then, this information will be used to subset the data into two groups. Trips that originate or end near a subway station, and trips that do not start or end near a subway station.

Then, we will examine the differences in the hourly and weekly usage for these two groups of trips.

We tried to incorporate the impact of having a street car route nearby as well but the shp file for street car routes were not available.

We want to see which bike stations are within 200 meters of a subway stations and the simplest way to do this is to create a buffer. The 'geometry' column in a GeoDataFrame has a method called .buffer(), which takes a radius argument in the units of your crs (meters in the case of EPSG:26917

A new variable called subway_stations_buffer and set it equal to subway_stations with a 200 meter buffer applied.

Now we want to test if a bike station is within 200 meters of a subway station. First we collapse all of the subway station buffer POLYGON into a MULTIPOLYGON object. We do this with the unary_union attribute.

Create a new column in subway_stations called 'bike_access' and assign it a boolean value. True if the station is within 200 meters of a bike station and False if it is not.

Create two variables, near_station_trips contains trips with a start or end destination near a subway station, and the second variable far_station_trips contain trips where both the start and end destinations were far from the subway station.

Analysis of the hourly usage shown in Figure 42 - Hourly Distribution by Membership Type of Ride Counts for Bike Routes Far and Near Subway Station does not show any differences in the hourly usage between routes that have at least a start or end location near a subway and routes that are not near the subway.

The analysis of the bike usage by day of week, Figure 43 - Ride Counts by Day of Week for for Bike Trips Far and Near Subway Station show that the usage is more common on the weekdays for routes near subway, but the same trend is not observed for bike trips that are not located near subway stations.

Use the groupby.agg() function to determine the daily ride count for routes far from subway and near subway station. Figure 44 - Average Daily Rides (with 95% Confidence Interval) for Day of Week and Bike Route in Near and Far from Subway Station shows the same trend as observed in the total count by day of week ( Figure 43). For routes that are near from the subway station, the average rides are higher during the week. This is not the case for routes that are not close to subway stations.

The annual trend in the bike usage for routes far from subway and routes near subway show an interesting trend. Figure 45 - Daily Ride Counts by Day of Year between 2017 to 2020 \n for Bike Routes Near and Far from Subway Station shows that the number of bike rides far from the subway seems to be increasing more rapidly over the years, compared to the bike rides near the subway station. This is more clearly visible in the montly ride count shown in Figure 46 - Monthly Ride Counts by Day of Year between 2017 to 2020 \n for Bike Routes Near and Far from Subway Station.

2. If everyone is travelling along bike paths, which bike paths are the most congested and at what times of the day?

The city of Toronto publishes data about its current Bikeway Network. Import bike lane file 'bikeway_network.shp'. As part of this analysis, we will determine which bike paths are most congested assuming that everyone is travelling along bike paths. Once we determine which bike paths are most congested, we will look at the hourly usage along those routes to determine what times of the day the lanes are most busy.

In this analysis we will use the network analysis using the networkx python module. To be able to conduct network analysis, it is, of course, necessary to have a network that is used for the analyses. OSMnx package that we just explored in previous tutorial, makes it really easy to retrieve routable networks from OpenStreetMap with different transport modes (walking, cycling and driving). OSMnx also combines some functionalities from networkx module to make it straightforward to conduct routing along OpenStreetMap data.

Based on our observation of the shortest paths for the top 10 most popular routes (shown in blue on the above map), we have identified the following bike lanes to be most popular for the bikeshar users, assuming that they always take the bike lane for their routes.

Using a buffer around the shortest routes, the most popular bikes lanes have been identified. The bike lane on comissioner street is outside this 400m buffer zone, but has been included in the list due to its proximity to bike stations near Cherry Beach.

Figure 46 - Hourly Distribution of Ride Counts on Popular Bike Routes shows the distribution of rides on a hourly basis for rides along the most popular bike routes. From this we can infer when the popular bike lanes identified above will be the most congested. The most popular time during the day is between 5 and 8 PM on the weekdays. The most popular time during the day is between 2 PM and 8PM on weekends. One can expect the bike lanes around these routes to be most congested during this time periods.

Figure 47 - Average Monthly Ride Counts between 2017 to 2020 on Top 10 Most Popular Bike Routes shows the average monthly ride count for the past 4 years over the course of the year. The most popular months for biking are between June and Septemper. For this reason, one can anticpate the congestion on the bike lanes will be most likely during these summer months.

3. Are there seasonal trends in the trip duration?

To investigate whether there are seasonal trends in the trip duration, first we determined the daily average and plotted on a time series line plot as shown in Figure 46 - Daily Mean Trip Duration from 2017 to 2010. The graph shows that there is a seasonal trend in the trip duration, increasing over the summer months and decreasing in the winter months. It is interesting to note that there is a increase in the peak in 2020, and we speculate that this was at least partially due to the COVID-19 pandemic as more people began to use the bike for recreational purposes.

As Figure 47 - Monthly Mean Trip Duration from 2017 to 2010 shows, the average trip duration reaches a peak typically between June and August at the height of summer, and begins to decrease once the weather gets colder. The average trip duration was exceptionally high for 2020.

Looking at the relationship between the max daily temperature and the mean daily trip duration as shown in Figure 48 - Correlation between Temperature and Trip Duration, there is a positive correlation between the two parameters. The variation in the trip duration also seems to increase as the temperature increases.

This concludes our exploratory data analysis for the Toronto Bike Share Data.